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PARALLEL VISUALIZATION OF LARGE-SCALE AERODYNAMICS CALCULATIONS: 

A CASE STUDY ON THE CRAY T3E 

KWAN-LIU MA* and THOMAS W. CROCKETT* 

Abstract. This paper reports the performance of a parallel volume rendering algorithm for visualizing a large- 
scale unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing 
over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times 
larger than the one we examined previously. This high resolution dataset also allows us to see fine, three- 
dimensional features in the flow field. All our tests were performed on the SGI/Cray T3E operated by NASA's 
Goddard Space Flight Center. Using 51 1 processors, a rendering rate of almost 9 million tetrahedra/second was 
achieved with a parallel overhead of 26%. 

Key words, parallel rendering, volume rendering, scientific visualization, parallel algorithms, unstructured 
grids, computational fluid dynamics, T3E 

Subject classification. Computer Science 

1. Introduction. Leading-edge scientific computations with demanding memory and processing requirements 
are increasingly being performed on massively parallel supercomputers. As an example, researchers at ICASE and 
NASA Langley Research Center are performing large-scale unstructured mesh aerodynamics computations for 
three-dimensional aircraft configurations on state-of-the-art parallel supercomputers such as the Cray T3E and SGI 
0rigin2000 [8]. The computational meshes they are using each contain several million data points. The largest one is 
a transport aircraft configured for take-off (high-lift, with flaps extended), which uses up to 24.7 million grid points 
to derive good predictions of lift and drag for varying angles of attack. 

Visualizing the solution data from this type of calculation is particularly challenging because the associated 
unstructured meshes are typically large in size and irregular in both shape and resolution. Figure 1.1 displays a sur- 
face mesh used in the high-lift analysis work [8], and Figure 1.2 shows a close-up view of the same mesh. The 
corresponding volume mesh would be too cluttered to view directly. Visualizing unstructured-grid data has been an 
active area of research in recent years [2, 5, 6, 7, 9, 10, II, 12, 13]. However, interactive performance for high-fidel- 
ity visualization of large datasets, such as the high-lift analysis solutions, can only be obtained with the help of 
parallel computers. 

At ICASE, we have been developing a parallel volume Tenderer for unstructured-grid data. Our design is based 
on cell-projection rendering and a multiplexed asynchronous communication algorithm. Effective static load bal- 
ancing is achieved with a round robin distribution of volume data cells among the processors, combined with a fine- 
grained interleaved partitioning of the image. A spatial partitioning tree is used to ensure locality during the render- 
ing process, thereby improving the performance of the image compositing step and reducing memory consumption. 
Communication cost is reduced by buffering messages and by overlapping communication with rendering calcula- 
tions as much as possible. 

In [6], we show that this algorithm scales well with increasing numbers of processors on the IBM SP2. Parallel 
efficiencies of 70% or better were maintained for up to 128 processors. However, our tests used a relatively small 
dataset containing only 103,064 data points (567,863 tetrahedra). We did not know if the same algorithm would 
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Fl<i. 1.1. Illustration of the surface grid for an aircraft in high-lift configuration. The mesh resolution near the wing surface is particularly 
fine. Over 909c of mesh elements are in the vicinity of the wing surface. 



FKi. 1 2. A close-up vi>w of the surface grid for the high-lift configuration . which shows the mesh structure near the two leading edge slats. 


scale well with data size, or whether increased numbers of ray segments would lead to communication bottlenecks. 
The size of the high-lift analysis solution data allows us to verify the scalability of our algorithm for larger datasets. 
This paper presents test results for a dataset containing 3,107,075 grid points (18,216,138 tetrahedra), which is about 
32 times larger than the one we used previously. The size of this dataset also allows us to profitably increase the 
image resolution, providing an opportunity to study performance as a function of image size, and to produce visual- 
ization results which reveal tine details in the modeled phenomena. 





The rest of the paper is organized as follows. Section 2 reviews the basic parallel rendering algorithm. Section 3 
provides a brief overview of the T3E architecture. Section 4 then presents experimental results obtained on the Cray 
T3E using up to 51 1 processors. Section 5 illustrates some of the visualization results we have obtained with our 
Tenderer, and we conclude this case study in Section 6 with a summary of our results and directions for future 
research. 

2. A Parallel Volume Rendering Algorithm for 3D Unstructured-Grid Data. A more thorough description 
of our parallel volume rendering algorithm can be found in [6]. In this section, we only highlight the design 
principles. The basic algorithm performs a sequence of tasks: 

• Distributing data and visualization parameters 

• Space partitioning 

• Viewing transformation 

• Scan conversion of tetrahedral cells 

• Merging of ray segments 

• Assembly and output of final images 

The volume data is distributed in round robin fashion with the intention of dispersing nearby cells as widely as pos- 
sible among processors. This is because the data cells come in different sizes and shapes. The difference in size can 
be as much as several orders of magnitude due to the adaptive nature of the unstructured mesh. As a result, the pro- 
jected image area of a cell can vary dramatically, which produces similar variations in scan conversion costs. Cells 
which are in proximity tend to have similar sizes, so dispersing them helps to average out load imbalances due to 
cell size. 

Once the volume data is distributed, a preprocessing step performs a parallel, synchronized partitioning of the 
volume data to produce a hierarchical representation of the data space. We use a A- d tree [1] because of its ability to 
adapt to the structure of the data. The A-d tree is used in the rendering step to restore locality which is lost in the data 
distribution step, resulting in more efficient image compositing and reducing runtime memory requirements. 

The principal difference between the current algorithm and the one described in [6] is in the image partitioning 
strategy. Because the types of unstructured grids we deal with can have small regions of very high cell density, we 
found that our original scanline interleaving scheme exhibited load imbalances as the number of processors 
approached the number of scanlines. In contrast, the current algorithm uses a very fine-grained pixel interleaving 
scheme which effectively distributes high density regions over more processors, resulting in better load balancing 
and improved scalability. 

A cell projection method is used to render the volume data. However, cells are not pre-sorted in depth order. 
Instead, each processor traverses the A-d tree in the same fixed order, scan converting its local cells to produce ray 
segments. The ray segments are then routed to their final destinations in image space for merging. A double-buffer- 
ing scheme is used in conjunction with asynchronous send and receive operations to amortize communication 
overheads and to overlap communication of ray segments with rendering computations. Scan conversion of data 
cells and merging of ray segments proceed together in multiplexed fashion. When scan conversion and ray-segment 
merging are finished, each processor sends its completed subimage to a host processor which assembles them for 
display. 

Logically, the scan conversion and image merging operations represent separate threads of control, operating in 
different computational spaces and using different data structures. For the sake of efficiency and portability, how- 


Although the Goddard T3E contains 1048 computational processors, the per-job limit is 512. Our current implementation uses one processor to 
coordinate data distribution and image assembly tasks, leaving a maximum of 51 1 processors available for rendering computations. 
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ever, we have chosen lo interleave these two operations using a polling strategy. Each processor starts by scan 
converting one or more data cells. Periodically the processor checks to see if incoming ray segments are available; if 
so. it switches to the merging task, sorting and merging incoming rays until no more input is pending. The resulting 
communication pattern is both view- and data-dependent. but generally requires each processor to communicate 
with most, if not all. of the other processors. 

Due to the asynchronous nature of the rendering algorithm, individual processors are not able to determine on 
their own when a frame is complete. Hence a distributed termination detection protocol is employed. Our original 
Tenderer used a straightforward procedure in which the host processor collected information from each rendering 
processor and then notified them all when it determined that the overall rendering operation was complete. Our cur- 
rent version improves on this using a binary merging algorithm based on ray-segment counts. The new approach 
runs in logarithmic, rather than linear, time, and does not involve the host, making it more efficient and scalable to 
larger numbers of processors. 

We have identified a dozen different variables which can affect the performance of this algorithm on any given 
architecture. Some of these depend on the contents of the input data; others are determined by the viewing and visu- 
alization parameters specified by the user; and still others are parameters of the algorithm. We will discuss several of 
these issues in more detail in Section 4, but first we provide a brief overview of the T3E architecture. 

3. SGI/Cray T3E. All of the tests reported here were performed on the SGI/Cray T3E computer operated by 
NASA’s Goddard Space Flight Center. The T3E is a distributed-memory massively parallel computer system. 
Although memory is attached directly to each processor (physically distributed), it is globally addressable. In the 
interest of portability, we have chosen not to exploit this feature directly, preferring instead lo rely on MPI message 
passing for interprocessor communication 

In Goddard's T3E, each PE contains a 300 MHz DEC Alpha 21 164 microprocessor with peak performance of 
600 M FLOPS, and 128 megabytes of local memory. About 120 megabytes per processor can be used by the applica- 
tion program. The system as a whole contains 1088 processors, of which 1048 are available for application 
workloads, with a per-job maximum of 512 PEs. All PEs are connected by a bidirectional 3D torus communication 
network with peak data bandwidth of 480 megabytes per second in every direction. A recent cross-platform study of 
a parallel polygon Tenderer with communication characteristics similar to those of our volume rendering algorithm 
concluded that the T3E delivered performance which was superior to that of its contemporary competitors |3], 

4. Test Results. To study the scalability of our rendering algorithm, we performed a series of tests using both 
the large dataset containing 1 8.2 16, 1 38 tetrahedra, and the small dataset containing 567,863 tetrahedra. We used the 
small dataset in our previous study [6] to examine in detail each component of the parallel overhead and to fine tune 
our algorithm. In this study, we focus particularly on the following parameters: 

• number of processors 

• data size 

• image size 

• ray-segment buffer size 

• polling frequency 

For each lest, three different views were used, and the average rendering time was recorded. Color Plates 4. 1 and 4.2 
show the view sequences used with the small and large datasets respectively. With the current steady-state solu- 
tions. data input and distribution is performed only once, and is therefore not included in the rendering time. To 
expedite our experiments, the large dataset used in our tests has been reduced from double precision to single preci- 
sion, and contains only a single scalar quantity at each grid point. It occupies 325 megabytes of space on disk; 
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Comparative Rendering Performance 

400 x 400 image, fixed buffer size and polling interval 



— 567 862 cells (buffer size = 80 polling interval = 160) 


18,216,138 cells (buffer size = 100, polling interval = 200) 


FIG 4, 1 Rendering rates (tetrahedra/second) on the T3E using up to 51 J processors. 


reading and partitioning it among 128 processors requires approximately 33 seconds. We have made no attempt to 
optimize our image assembly and display procedures, so these times are excluded from the rendering rates as well. 

4.1. Performance and scalability. The first set of test results is summarized in Figure 4.1. We compare the 
rendering rates for the large and the small datasets. Because of memory requirements, a minimum of 42 processors 
is needed to render the large dataset. The plots show that performance with the large dataset increases steadily 
through 511 processors. The large number of tetrahedral cells entails enough computational load to keep the 
parallelization overhead manageable. Performance with the small dataset is also good through 256 processors, but 
peaks around 320 processors and deteriorates beyond that point. With 128 processors, our current implementation on 
the T3E renders the small dataset more than 4 times faster than its predecessor on the IBM SP2 [6]. 

Figure 4.2 compares the relative contributions of computation and parallel overhead to the total rendering time 
for varying numbers of processors. Computation includes frame initialization, tree traversal, scan conversion, ray- 
segment merging, and control flow. Overhead represents additional costs incurred due to the parallel implementa- 
tion, and includes data copying, communication, termination detection, and idle time due to load imbalance and 
network congestion. For the large dataset, the overheads comprise about 17% of the time on 64 processors, gradu- 
ally increasing to 26% of the time with 51 1 processors. This slow rate of growth suggests that even more processors 
could be used effectively for rendering datasets of this size. For the smaller dataset, useful computation drops below 
50% at about the same point where performance peaks in Figure 4.1, around 320 processors. 

Table 4.1 provides a more detailed breakdown of the computation and overhead components for the 51 1 -proces- 
sor case. Ray-segment merging is by far the most expensive operation, costing more than four times as much as scan 
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Computation Time vs. Overhead 

400 x 400 image, fixed buffer size and polling interval 



No. of Processors 


567 862 cells 

| computation 

^ overhead 

18,2161 38 cells 

B computation 

FI overhead 


Mg. 4.2. Computation time and parallel overheads for varying numbers of processors. Buffer sizes and polling intervals are the same as in 
Figure 4.1. 


conversion. Communication costs (data copying, send and receive latencies, polling, and synchronization) seem to 
be well under control. The dominant overhead appears to be the end-of-frame termination detection protocol, but 
this is partly an artifact of the way termination time is measured. Once a processor has finished scan converting all 
of its cells, it enters a polling loop, waiting for either incoming ray segments from other processors, or termination 
messages. The test for the latter requires a trip through the termination detection routine on each iteration of the 
loop, so that much of the reported termination cost could be interpreted instead as polling overhead and/or idle time 
(receive wait). Given this interpretation, the primary overhead then becomes wait time, which is mainly a reflection 
of load imbalance. 

Together, the results in Figures 4. 1 and 4.2 and Table 4. 1 demonstrate that the T3E is well-equipped to handle 
the massive communication generated by this application. With the large dataset on 51 1 processors, an average of 
more than 44.6 million ray segments have to be communicated per frame. At 24 bytes per ray segment, the aggre- 
gate communication volume is about 1 . 1 gigabytes. One of the advantages of our algorithm is that this load does not 
get injected into the network all at once, but is spread out over the duration of the frame time, determined in part by 
the choice of buffer size. Nonetheless, with an advertised bisection bandwidth of 122 GB/s in a 512-processor con- 
figuration, it seems unlikely that we would ever tax the communication capabilities of the T3E. 

4.2. Image size. While an image size of 400 x 400 may provide adequate resolution for smaller volumetric 
datasets, it doesn't do justice to the larger problem considered here. To gauge the impact of higher image resolutions 
on performance, we have repeated our experiments with image sizes up to 1200 x 1200 pixels. The results are shown 
in Table 4.2. As can be seen, performance tracks the increase in image resolution closely. This is to be expected 
given that the principal execution time components are dependent on the number of ray segments generated. The 
lower overhead fraction for the 1200 x 1200 case is due to a larger ray segment buffer ( 1 ,000 vs. 100), which results 
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Tablh 4. 1 

Execution time components for 18,216 , J 38 cells on 5/1 processors at 400 x 400 pixels. 


Component 


Time (secs) Percentage | 


Overheads 

Ray-segment copying 


1.98 

Send latency 

0.046632 

2.29 

Receive latency 

0.042881 

2.1 1 

Send wait 

0.000000 


Receive wait 


7.94 

Polling 


1.77 

Termination detection 

0.187978 

9.24 

Synchronization 

0.013291 

0.65 

Total overhead 

0.528805 

25.98 


| Computation | 

initialization 


4.81 

Scan conversion 

0.235613 

11.58 




Other 

0.223697 

10.99 

Total computation 

1.506447 

74.02 


| Total 


2.035252 


100.00 


Tahi>: 4.2 

Rendering performance at different image resolutions using 128 processors with the large ( 18.2 million cell) dataset. Times are in seconds 



Image Size j 

400 x 400 

800 x 800 

1200 x 1200 

Ray segments (millions) 

44.7 

178.0 

400.0 

Overhead time 

1.111 

4.355 

5.613 

Compute time 

4.711 

16.865 

39.721 

Total time 

5.822 

21.220 

45.334 

Overhead percentage 

19.1% 

20.5% 

12.4%> 


in more efficient communication with this high resolution image. The polling interval was set to 200 for all three 
cases. 

4.3. Communication parameters. As our results from Section 4.2 suggest, the choice of communication 
parameters (buffer depth and polling interval) can have a significant impact on performance. Although this problem 
has been studied in some detail in the context of parallel polygon rendering algorithms [4], the situation here is more 
complex due to the interaction of additional parameters such as the depth of the k-d tree and the choice of opacity 
transfer functions, both of which can have a significant impact on communication and computation performance. 
Thus guidelines for selecting optimal communication parameters are far from obvious, and a detailed analysis is the 
subject of ongoing investigation. 

Figure 4.3 displays the impact of buffer depth on performance with the large dataset using 192 processors. The 
buffer size varies from a minimum of 25 up to 1 .25 times the expected useful maximum. We define the expected use- 
ful maximum as the average number of ray segments which need to be communicated from each processor to every 
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Rendering Time vs. Buffer Depth 

192 processors, 18.2 million cells, GSFC T3E 



lit;. 4,3, Rendering time as a function of buffer depth for wo different polling strategies. Image size is 401) x 400. 


other processor. This value is highly problem-dependent, varying as a function of the input data, viewing parame- 
ters, opacity mapping, and number of processors. For this case, the expected useful maximum is empirically 
determined to be an even 1200. 

Normally one expects that as buffer size increases, performance will improve, since message-passing over- 
heads are amortized over more data items. However, using a buffer size which is too large can eliminate the benefits 
obtained by spreading the communication load over time, particularly on bandwidth-limited systems [4], For buffer 
sizes at or above the expected useful maximum, the behavior is equivalent to a simpler algorithm in which all of the 
communication is deferred to the end of the frame, and the advantages of the asynchronous approach are lost. 

This loss of performance with increasing buffer size can be clearly seen in Figure 4.3, although it appears at 
smaller buffer depths than would be expected given the high communication bandwidth on the T3E. We suspect that 
this premature degradation is caused by loss of locality in the ray merging operations. Larger buffers will tend to 
defeat the purpose of the £-d tree by delivering many ray segments at once which fall on a wider area of the image. 
Fewer opportunities for early ray merging arise, and ray segment lists will grow longer, with a corresponding 
increase in list insertion time and memory consumption. Note, however, that the vertical scale on the graph has been 
chosen to highlight the effect — the total variation in performance is only about 16%. 

Our experience with parallel polygon Tenderers indicates that the choice of polling interval is far less critical 
than buffer depth, and the results here seem to bear that out. The interval between polling operations should be big 
enough to amortize the cost of the polling call, but beyond that, just about any value will do. This is in part due to 
the deadlock avoidance properties of our algorithm. If a processor is blocked from sending, it automatically reverts 
to a receiving mode, whether or not the polling interval has been reached. 

Figure 4.3 also shows the results of two different polling strategies. In the first one, we pick a fixed polling 
interval of 200, i.e., the Tenderer will check for incoming data after scan converting 200 cells. In the second strat- 
egy, we set the polling interval to twice the buffer size. As can be seen, performance is very similar in either case, 
although the fixed polling interval is better behaved with larger buffer sizes. 

5. Visualization Results. Finally, we show a few visualization examples with the large dataset. Color Plate 5.1 
shows direct volume rendering of flow density surrounding the aircraft's wing. Color Plates 5.2 and 5.3 show 
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visualizations of velocity magnitude. Direct volume rendering of this high resolution data elicits many fine features 
in the flow field which would be invisible with conventional two- or three-dimensional contour plots. In particular, 
in Plates 5.2 and 5.3 we can clearly verify the low pressure region (red spherical cloud) above the wing, and the high 
pressure region (yellow- and orange blobs) below the wing. These two images also show the extreme low velocity 
values on the flaps (white stripes), and the complex flow- patterns ahead of the leading edge and behind the trailing 
edge of the wing. None of these detailed phenomena could be seen with either low resolution data or low resolution 
rendering. 

Some additional white patches appear as intermittent linear features near the upper and lower edges of the fuse- 
lage surface (which is also a grid boundary). We have yet to determine whether these artifacts are generated by the 
simulation, or are due to numerical problems in the Tenderer. Although the simulation produces double precision 
results, the dataset used for our tests has been reduced to single precision in order to save disk space, reduce I/O 
time, and conserve memory in the Tenderer. It is conceivable that this loss of precision is causing erroneous values to 
appear during the scan conversion process. 

6. Conclusions. We have conducted a series of performance tests with one of the largest unstructured-grid 
datasets used to date in parallel volume rendering research. Performance and scalability are good, and the T3E 
appears to be very well suited to the task. In particular, initial concerns about the volume of communication 
generated by such a large dataset appear to be unfounded, at least for this architecture. The primary impediment to 
interactive performance is ray merging time, suggesting that the additional communication and memory needed to 
support some form of early ray termination might be worth the cost. However, the visualizations shown here have 
very few truly opaque cells in them, so the utility of early ray termination is likely to be problem-dependent. 

While the results presented here show good performance, they are most likely sub-optimal, given the complexi- 
ties of tuning the algorithm to a particular dataset and a particular architecture. For maximum performance with 
minimum user intervention, a predictive, adaptive self-tuning strategy is needed so that the Tenderer can respond 
dynamically to changes in the input data, viewing parameters, or hardware configuration. 

We also plan to study an even larger dataset which contains nearly 150 million tetrahedra. It is clear that direct, 
brute-force rendering will not provide interactive response for datasets of this size, even with massively parallel 
architectures. Consequently, we are investigating ways to integrate a multiresolution scheme into the rendering step. 

In addition, a new generation of the high-lift analysis code uses mixed grids composed of prisms, pyramids, tet- 
rahedra, and hexahedra in order to achieve higher efficiency. Our cell-projection rendering algorithm can be easily 
generalized to handle such mixed grids. 

We have also conducted preliminary tests with the same algorithm on SGFs 0rigin2000 architecture. Our ini- 
tial results were poor compared to the T3E, consistent with the findings in [31. It is possible that our algorithm fares 
poorly on distributed shared memory architectures, or that deficiencies in memory management or message passing 
software are inhibiting scalability. We are designing new experiments for a more comprehensive study. 

In the meantime, the renderer has also been ported to ICASE’s 32-node Linux-based PC cluster, where it out- 
performs the T3E for moderate numbers of processors. To render the large high-lift dataset at 400 x 400 resolution, 
the T3E requires a minimum of 42 processors and averages 15.9 seconds per frame; on the cluster, 31 processors 
accomplish the same task in 15.4 seconds. Given that communication performance is much lower in the cluster, this 
difference is attributable primarily to faster processors (400 MHz Pentium IPs), and we would expect performance 
scalability to be much more limited than on the T3E. 
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Plate 4. 1 Test sequence for the small dataset (567,862 cells). Flowfield over an aircraft w ing with a missile attached. 


I:. 
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Plate 4.2 Test sequence for the large dataset ( 18,216,138 cells). Velocity field around a high-lift wing configuration. 



Plate 5.1. Density; a view* from above the wing. 
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Plait; 5.2. Velocity magnitude : a view from the fuselage toward the tip of the wing. 



PLAIT- 5.T Velocity magnitude; a view from above and behind the wing. 
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